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Abstract 

The use of science to understand its own structure is becoming popular, but understanding the organization of knowledge 
areas is still limited because some patterns are only discoverable with proper computational treatment of large-scale 
datasets. In this paper, we introduce a framework to combine network-based methodologies and text analytics to 
construct the taxonomy of science fields. The methodology is illustrated with application to two topics: complex networks 
(CN) and photonic crystals (PC). We built citation networks using data from the Web of Science and used a community 
detection algorithm for partitioning to obtain science maps for the two topics. We also created an importance index for 
text analytics, which is employed to extract keywords that define the communities and, combined with network topology 
metrics, to generate dendrograms of relatedness among subtopics. Interesting patterns emerging from the analysis 
included identification of two well-defined communities in PC area, which is consistent with the known existence of two 
distinct communities of researchers in the area: telecommunication engineers and physicists. With the methodology, it 
was also possible to assess the interdisciplinary nature and time evolution of subtopics defined by the keywords. The 
automatic tools described here are potentially useful not only to provide an overview of scientific areas but also to assist 
scientists in performing systematic research on a specific topic. 
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1. Introduction 

Recent developments in the use of machine learning 
methods to extract information (and knowledge!) from 
Big Data have shown that machines are bound to replace 
humans in various intellectual tasks in the near future, 
particularly in cases where a lot of information needs to 


be processed (Craddock et ah, 2008; Donovan, 2008 Bell 


et al., 2009). Clear examples of such tasks are facial recog¬ 


nition (Zhao et ah, 2003), establishing best routes for cars 


and passengers QLaporte 
and Giles 


1992), internet search (Lawrence 


1998), etc. Some authors have even been bold 


enough to suggest that scientific and technological devel¬ 
opment is being held back by the limited capacity of hu¬ 
mans, especially the memory, to process and interpret the 


electronic data available (Stone and Lavine 2014). A spe¬ 


cific task in academic work where this limited capacity is 
readily apparent is in carrying out a survey of any given 
topic, owing to the vast literature to be consulted. The 
first requirement for a survey, namely to establish a map 
of knowledge (also known as science map) of the field un¬ 
der analysis, demands data-intensive discovery. Surveys 
normally performed by humans benefit from well-founded 
techniques to organize scientific literature and informa¬ 
tion, but little help exists for understanding the knowledge 
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structure on a larger scale. Even experienced researchers 
find this hard owing to the aforementioned human limited 
capacity, and there is the additional drawback of bias - 
even if unintentional - toward the experts’ personal pref¬ 
erences. Not surprisingly, modeling the knowledge struc¬ 
ture remains an open problem in science with the intricate 
relationships among the many concepts involved. 

In this paper we propose a new framework to assist 
humans in preparing literature surveys, which consists of 
the integration of many well-established concepts arising 


from complex networks (Barabasi and Albert, 1999) that 


have been proven effective in modeling the organization of 
knowledge (Boyack et ah, 2005| |Borner and Scharnhorst| 


20091 ICosta et al.||2011||Silva et al.| 2013| Boyack and Kla- 


vans, 2014). Our approach, however, distinguishes itself 


from previous ones in the literature since network science 
and text analytics methods are interwoven to generate sci¬ 
ence maps and taxonomies. More specifically, we build 


citation networks (Chen and Hicks, 2004 Menczer, 2004 


Leicht et al. 2007) that serve as the overall framework 


of a science map, which needs to be complemented with a 
taxonomy to classify the contents of the map. We adapted 
the methodologies to extract keywords to complete the sci¬ 
ence map for two fields, namely “Complex Networks” and 
“Photonic Crystals”. This choice was basically due to the 
authors of the paper being experts in these fields, which 
allows for a deeper discussion of the results obtained. 
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2. Overview of Complex Networks and Text ana¬ 
lytics applied to summarization 

Because our study deals with two very distinct areas, 
namely use of complex network methods to analyze scien¬ 
tific literature and text analytics, a brief overview of previ¬ 
ous work will be done here for these areas. This overview 
is by no means exhaustive, particularly as there has been 
a vast literature in each of these areas; we rather concen¬ 
trate on work that is directly related to the purpose of our 
study, which is to provide semi-automated means for as¬ 
sisting authors in surveys of the literature and document 


summarization techniques Silva et al. (2011). 


Recent works have used network-based metrics to char¬ 
acterize or quantify relevance and impact of researchers, 


publications and journals (Ding et al., 

2009; Yan et al. 

2013 

|Nykl et al., 2015|; Zhou et ah,[2015 

McKeown et al. 

2016 

). For instance, factor analysis was 

employed to au- 


tomatically extract the most important papers in cita¬ 
tion networks (Chen 2012). Citation-based networks have 


been used in various domains, such as modeling the dy¬ 


namics of knowledge acquisition and dissemination (Borner 


and Scharnhorst| |2009; Amancio 

et al., 2012cj Amancio, 

2015a), enriching and contextualizing information of bio- 

logical experiments or data (Mullen et al., 

2014), and visu- 

alizing relationships among scientific fields by constructing 

science maps (Boyack et al., 2005 

Leydesdorff and Rafols 


2009 Borner et ah, 2012) 


Of particular importance are science maps used as a 
versatile tool to qualitatively understand how science fields 
are organized, by e.g. establishing relationships among dis¬ 
tinct areas (Boyack et al. 2005 |Leydesdorff and Rafols 


2009 


2008 


Porter and Rafolsp2009| |Rosvall and Bergstrom 


Silva et al.| 2010| 2011). Tools have been devel- 


oped to visualize and interact with scientific maps (Boy- 

ack et al., 

to 

o 

o 

to 

van Eck and Whitman 2010; Whaijer 

et al., 2011 

; Silva and Costa, 2011; Silva et al.| 2013 |van 

Eck and Waltman, 2014), and understand interdisciplinar- 

ity (Porter and Rafols, 2009 

Leydesdorff et al., 2013 

Silva 

et al., 2013 

Lariviere et al. 

2015 Leydesdorff et al. 

2015) 


among scientific journals. In a similar fashion, science 
maps can also be constructed by using self-organizing maps 
in which scientific domains are mapped to a 2D space ac¬ 
cording to a neural network through a Hebbian learning 


process (Skupin et al. 2013). While science maps are able 


to provide interesting insights about the overall structure 
of science, a contextualized taxonomy of its structure is 
more appropriate to the task of surveying a scientific field. 
This is because survey papers are conventionally organized 
in a hierarchical structure, normally comprising chapters, 
sections, subsections and other forms of text partitions. 
Establishing such taxonomy, with components and sub¬ 
components hierarchically organized, is not trivial for au¬ 


tomated tools (Sebastiani 2002 Silva et al., 2013), and 


various procedures have been adopted to classify contents. 

Text summarization is a traditional area of text ana¬ 
lytics, which has been used to build summaries and tax¬ 


onomies of text datasets comprising many types of situa¬ 
tions, such as tracing the events of disasters using social 


media (Kedzie et al. 2015), conferences (Shen et al., 2013) 


and sports events (Nichols et al., 2012). The main goal 


with such techniques is to obtain an importance metric 
(also called salience ) for terms or sentences. The sum¬ 
mary of the content can be constructed by rewriting the 
text using only terms or sentences presenting high salience, 
while the taxonomies can be obtained by clustering texts 
according to the similarities among their most important 
terms. This can be accomplished through the use of met¬ 
rics such as cosine similarity (Salton and Buckley 1988) 


or semantic-wise similarities (Boyack et al. 2011), as in 


relationships in the WordNet or word embedding tech¬ 


niques (Levy and Goldberg, 2014). A simple way to obtain 


the salience of terms is by comparing their relative fre¬ 
quency of appearance inside a document to their frequency 
of appearance in a larger set of other documents. This is 


usually referred to as the TF-IDF (Salton and Buckley 


1988) method, which yields good results for sets of large 


texts. However, the method becomes unreliable when mea¬ 
suring relevance of terms in sets of small texts, since terms 
tend to appear only a few times for each document, as in 
paper abstracts and messages of social networking services. 
Other, more complex, summarization techniques can be 
used to deal with such type of data. Examples are super¬ 
vised machine learning methods that require a small set of 
golden summaries used to train a machine to detect impor¬ 
tant terms. Human readable summaries may be generated 


from a document or a set of documents (Radev and McK- 


eown 


1998) by using features of low contextual content, 


such as the average number of words or the number of 


capitalized words in a sentence (Nenkova and McKeown 


2012 ). 


As an alternative to machine learning methods, topics 
analysis (B lei et ah] 2003) has been employed to find im¬ 
portant terms (keywords) in a set of documents, such as 
articles or abstracts (Griffiths and Steyvers, 20041, where 


terms are projected and clustered according to their pres¬ 
ence in a set of documents. This is done by estimating a 
Markov chain model of topic information along the docu¬ 
ments, normally obtained by Gibbs sampling. This tech¬ 
nique presents high computational cost, as it requires sev¬ 
eral iterations to estimate the transitions between words, 
but it can give good results depending on the size of each 
document, the number of documents and other proper¬ 
ties of the dataset, as studied in depth by Tang et. al in 


Ref. (Tang et al. 2014). 


Methods derived from network science have also been 
used for document summarization. The LexRank tech¬ 


nique (Erkan and Radev 2004) relies upon a network of 


similarity between sentences to obtain topological central¬ 
ity measurements, such as eigenvector centrality | Newman 
(2010). The centrality measurements are then used to 


quantify the salience of terms. In a similar fashion, word 
adjacency networks were employed to find keywords in a 


text (Amancio et al., 2012b), where salience was obtained 
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from the diversity measurement of nodes (Viana et al. 


2010) and provided superior results to traditional central¬ 
ity measurements in networks. Such kind of analysis is 
advantageous compared to multiple text analytics meth¬ 
ods for the same dataset since information provided by 
network based techniques does not overlap with that pro¬ 
vided by traditional text analytics techniques (L i et ah] 

Amancio] 2015b| |Silva 


2012 Newman and Clauseh 2015 


and Amancio, 2012 Amancio et ah, 2012e). 


3. Methodology 

A survey paper is taken here as an organized structure 
that summarizes information about a scientific field. It 
must limit the level of detail for each topic by highlighting 
the most relevant pieces of information while also reducing 
their redundancy. The hierarchy in a survey comprises 
concepts that are progressively merged together by their 
relatedness to build major contextual structures such as 
subsections and sections, as exemplified in Fig. |TJ Topics 
are hierarchically structured, each of which can represent a 
set of papers or other scientific works relevant to the area. 


General Topics 


Specific Topics 


Topic 1 -2 


^Main Topic 


Section 1 


Section 2 1 



Paper A 
Paper C 


Paper E 
Paper F 


Paper G 


/ f-Paper H 

/ "! f-Paper I 

•-PaperJ 


Section 3 


Relevance Threshold 


- PaperM 

- Paper N 


Figure 1: Example of the structure of an organized scientific survey. 
Papers are grouped into more general topics which are reflected as 
sections, subsections, chapters, etc. A threshold of relevance and 
focus is thus necessary as their content needs to be summarized and 
cannot retain the full level of detail for each paper. 


To determine the hierarchy of components and subcom¬ 
ponents, we first built citation networks for the fields Com¬ 
plex Networks (CN) and Photonic Crystals (PC), whose 
papers were retrieved from the Web of Science (WOS^] 
database using the query terms ” complex network” and 


a http: //thomsonreuters. com/thomson-reuters-web-of-science/ 


”photonic crystal” (including the plural variations), re¬ 
spectively. For each retrieved paper, we extracted the title, 
abstract, publication year, citation count and list of ref¬ 
erences. Two citation networks were built (CN and PC) 
where nodes represent the papers and an edge was estab¬ 
lished between two papers if one cites the other. 

There are many ways to construct citations-based net¬ 
works. They can be drawn directly from the citation struc¬ 
ture, in which two papers are connected if there is a cita¬ 
tion between them, resulting in an unweighted directed 
network. Also used in several studies are co-citation net- 


works ( 

Usdiken and Pasadeos 1 

1995 

Jenssen et al. 

2001 

Chen 2004 

Ding et al. 

2009), where documents are con- 


nected if they share a citation with at least another docu¬ 
ment. This procedure leads to a weighted undirected net¬ 
work, and the number of shared documents can be used 
as a metric of similarity among documents. 

For the sake of simplicity, here we opted to use tradi¬ 
tional citation networks, but we do not take into account 
the direction of citation connections. We understand that 


this information is relevant in several other studies (Chen 
2004 |Menczer[ |2004|) , but not here because we 


and Hicks 


use citation networks to represent a knowledge relationship 
structure which is naturally undirected. As an alternative, 
we also applied the analysis presented in this work to co¬ 
citation networks as shown in the supplementary material, 
and found similar results in the analysis. However, such 
networks are denser and harder to discuss and visualize. 

The citation networks were constructed by first ob¬ 
taining the vertices from papers returned from the cho¬ 
sen queries for CN and PC in the Web of Science dataset. 
Next, citation information was used to connect pairs of 
cited papers where papers that were not present in the 
initial queries were ignored (even if cited by others). This 
avoids problems caused by dangling nodes, which can im¬ 
pact the topological analysis employed here, such as com¬ 
munity detection. 

Citation networks can be transformed into science maps 
if the most relevant topics and their inter-relationships 
are identified. In this study, the CN and PC citation 
networks were embedded in a 3D space using a force- 
directed method based on the Fruchterman-Reingold algo¬ 
rithm (Fruchterman and Reingold 1991). The initial con¬ 
figuration had the nodes, treated as particles, uniformly 
distributed over a 3D space. These nodes were allowed 
to interact via repulsive forces, with attractive forces be¬ 
ing added for the connected nodes. When the energy of 
the whole system was minimized, the resulting embedding 
became a graphically appealing projection of the network- 
topology (Silva et ah, 2013 Bando et al. 2013). In print, 
only static 2D projections of the network can be visual¬ 
ized, but the network structure can be further examined 


with a visualization tool (Silva et al. 2013 Bando et al. 


2013). This is important because real system topologies 


may exhibit very high dimension, hence not suitable to be 


projected on the plane (Daqing et al., 2011). 


The main topics in a field are associated with communi- 
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ties in the citation networks, which were determined by ap¬ 


plying the multilevel community detection method (Blon- 


del et al. 2008). This procedure assigns each paper to a 


non-overleaping community. It was chosen because it al¬ 
lows for establishing a high modularity for the network, 
while keeping the computational cost reasonable in com¬ 
parison with more sophisticated methods such as the opti¬ 


mum modularity (Newman, 2006). By a high modularity 


we mean that the communities in the network are well 
distinguishable from each other. It is important to high¬ 
light that the multilevel community detection method is 
stochastic, thus, for each run, a distinct community struc¬ 
ture can be attained for the same network. However, 


as discussed in ref. (Blondel et ah, 2008), the resulting 


community partitioning for distinct runs are very sim¬ 
ilar among themselves and display high correspondence 
to those obtained by other algorithms or expected from 
benchmarks. 

The relationships among communities were further ex¬ 
amined by generating a coarse-grained graph of the net¬ 


work (Rosvall and Bergstrom 2008), in which each com¬ 
munity was replaced by a single community node and its 
connections. The edges between each pair of community 
nodes (a, /3) were weighted by W a p according to the stochas¬ 
tic probability of connections between communities a and 
/3 given by: 


W, 


E, 


a.f3 


a,j3 




(i) 


where E a p is the number of connections among nodes of 
communities a and /?. 

Since determining the communities which are most cen¬ 
tral or peripheral in the science map is an important tar¬ 


get, we employed the accessibility metric (Travencolo and 
Costa[ |2008| Travencolo et al.[ |2009 Arruda et al.[ 2Q14[ 


Amancio 2015b), which is a local node-centered measure¬ 


ment based on the heterogeneity of probabilities of reach¬ 
ing nodes in random walk dynamics. The smaller the 
accessibility of a node the more peripheral it is. This 
metric has been successful in separating the topological 
center and border regions of networks while avoiding the 
drawbacks of traditional measurements such as between¬ 
ness centrality. 

Ideally, the communities in the citation network should 
be labeled with the topics and subtopics of a well-established 
taxonomy for the scientific field under analysis. However, 
as already mentioned in the Introduction, there is no sim¬ 
ple way to generate such high-level taxonomy automati¬ 
cally. Most authors have therefore resorted to extracting 
keywords (see Refs. Andrade and Valencia| (|1998 ); Man¬ 


ning and Schiitze ( 1999| ) ; Hulth| ( |2QQ3 ); |Carretero-Campos 


et al. ( 2013| ) for methods of keyword extraction), for which 


the majority of the methods make use of large amounts of 
text. In our case, because we only considered the Ab¬ 
stracts from each paper (representing a node in the net¬ 
work), we had to adapt existing methods. We devised 
a measurement to quantify the importance of keywords, 


made with unigrams and bigrams, for each network com¬ 
munity. Unigrams and bigrams were extracted for each pa¬ 
per by analyzing its abstract, from which stop-words were 
removed and the remaining words were lemmatized. This 
pre-processing step is essential for the analysis because it 
removes words conveying little semantic content and se¬ 
mantically related words are aliased under the same word 


if they share the same canonical form (Amancio et al. 
2012a|d |Amancio| |2015a|). The importance index was de¬ 


signed to quantify the relative frequency of a word ap¬ 
pearing inside a community against its frequency on the 
remainder of the network. First, we count the total num¬ 
ber of times n a (w) a paper presenting a word w appears 
inside a community a. Next, we calculate the relative in¬ 
community frequency, F™(w) given by: 


K n (w) 


n a {w) 


( 2 ) 


where |a| is the number of papers associated with a com¬ 
munity a. Analogously, we define a relative out-community 
frequency: 


K ut H = E 


n 7 (re) 




N - 


O' 


( 3 ) 


which accounts for the total relative frequency considering 
all communities excluding cq where N is the total num¬ 
ber of papers in the network. Then, we define our mea¬ 
surement of importance of keywords, /(re), as the high¬ 
est difference between the relative in-community and out- 
community frequencies of a word: 


I(w) = max[/CH - F° ut {w)]. 


( 4 ) 


The keywords ranked according to the importance in¬ 
dex I(w) were used to create trees to simulate the structure 
of a survey, as shown Figure [l] The hierarchy tree (dendro¬ 
gram) was obtained by a hierarchical agglomerative clus¬ 
tering method ( Duda et al.||200f| Costa and Cesar||2009 ), 
in which we used the average shortest path length, (£) uv , 
among pairs of keywords (u, v). In this procedure, we first 
obtained the shortest path lengths between the pairs of 
papers (i,j) in the citation network. Next, for each key¬ 
word pair (u,v) we calculated the average of iij among 
pairs of abstracts (Ai,Aj) of papers (i,j), where the key¬ 
words u and v were respectively present. This can also be 
written by the following equation: 


E 

(u,v) e ( AiXAj ) 


(u, v) e (Ai x Aj )| ’ 


( 5 ) 


As a consequence, groups of keywords are progressively 
clustered together according to the average topological dis¬ 
tance between them. Therefore, our approach to generat¬ 
ing dendrograms incorporates both concepts from complex 
networks and from text analytics. This was crucial because 
clustering the keywords using only the Abstracts would not 
be precise as the amount of text is limited. 
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Since unigrams and bigrams were ranked according 
to the same measurement, if a bigram has high /(re), 
their compounded unigrams are very likely to also feature 
among the top keywords. To address this problem, we re¬ 
moved the unigrams from the set of keywords that are part 
of any other bigram in the set. By doing this, we eliminate 
an immediate layer of redundancy among keywords while 
also giving priority to more specific keywords (bigrams). 
For the PC field, we also generated a dendogram using 
keywords suggested by an expert, in which we omitted 
generic keywords covering more than 50% of the network 
(e.g. photonic crystals and fiber). 

The temporal evolution of the fields considered was 
studied in terms of timelines for the keywords, i.e. how 
the frequency of each keyword changed over time. 

The proposed methodology can be summarized as fol¬ 
lows: 


Obtain the citation network among the papers of the 
corresponding dataset. 

Obtain the words , corresponding to the n-grams present 
on both titles and abstracts of each paper. Here, we 
considered only unigrams and bigrams for the analy¬ 
sis. Also, we removed stop-words and the remaining 
words were lemmatized. 

Apply a community detection algorithm to the net¬ 
work, thus obtaining a partitioning of papers. Here, 


we opted to use a fast multilevel technique (Blondel 


et al. 


2008). 


4. 


6 . 


7. 


10 . 


Calculate the in-community frequencies, F™(w), for 
each word w for all the communities, according to 
equation [2] 

Calculate the out-community frequencies , F° ut (w ), 
according to equation [3j 

Calculate the importance index , I(w ), of each word 
w using equation [4] 

Sort the words according to the importance index 
and select an amount from the top. Here, we selected 
the first 50 keywords to pair a similar amount of 
keywords provided by an expert. 

Apply a hierarchical clustering method to the se¬ 
lected keywords, where the dissimilarity between two 
keywords corresponds to the average topological dis¬ 
tance between papers presenting such words. This 
procedure results in the dendrogram of keywords. 
The keywords can also be used to label the commu¬ 
nities they belong to. 

By using network visualization techniques, project 
the network to a 2D or 3D space and use the commu¬ 
nities and the generated labels to obtain a scientific 
map (|Frnchterman and Reingold] 1991 Silva et al. 


2013 |Bando et al., 2013). In this work we employed 


the Fruchterman-Reingold algorithm and, for com¬ 


parison purpose, we also use the VOS Viewer (van 


Eck and Waltman 2010) visualization tool 


It should be noted that the techniques employed in 
each step of our framework can be replaced by similar 


methods. For instance, one can use other visualization 
tools and techniques to construct science maps, or one 
can employ other community detection algorithms. While 
an extensive combination of techniques and parameters is 
still needed to uncover benefits and disadvantages of the 
framework, here we illustrate it by choosing only one set 
of methods and parameters. These correspond to the most 
traditional or simple methods required for each step. 


4. Results and Discussion 

We obtained two networks from the dataset, the CN 
network comprising 11,063 papers with average degree 
(kout) ~ 8-5? and the PC network encompassing 20,230 
papers and presenting (k^j.) ~ 6.6. Papers published from 
1991 to 2013 were included in the networks. The struc¬ 
ture of the CN network revealed 22 communities yielding 
a modularity qcN ~ 0.53, while 20 communities were iden¬ 
tified with modularity qpc ~ 0.65 for the PC network. 


4-1- CN network analysis 

Fig-i a) displays the science map from the CN citation 
network, where the colors denote the communities associ¬ 
ated with the top keywords according to the importance 
index of Eq. [4j As expected by the high modularity, each 
module fills distinctive regions of the network topology. 
The only exception appears to be communities B and D 
that seem to share the same region, but this is an artifact 
of the 2D projection. A clear separation is confirmed in 
the 3D visualization (as shown in video SI in the supple¬ 
mentary material). It is interesting that most communities 
originate from a densely central region of the projection, 
as can be observed in the figure. This indicates that nodes 
at the central region are much more interdisciplinary. 

The coarse-grained graph of the CN network is shown 
in Fig. [2|b), which features communities B , C and D 
strongly connected among themselves. Community B (epi¬ 
demic spreading dynamics) glues together many commu¬ 
nities, being at the heart of the network alongside commu¬ 
nity H (fractal, self-similar). This is probably because epi¬ 
demic dynamics represented by community B has a wide 
variety of applications in network science ( Costa et al.| 
2011). In spite of being the largest community, A (syn¬ 
chronization and coupling) only connects strongly to G 
(brain and cortical networks), highlighting the applica¬ 
tion of synchronization dynamics to modeling neuronal 
networks. Surprisingly, community E (gene regulatory 
networks, protein interaction, etc) is the lesser connected 
among the communities. Besides, it presents no remark¬ 
able connection preference pattern, i.e. it is uniformly and 
weakly connected to other communities. This indicates 
that papers in this community still do not fully benefit 
from the tools and methodologies provided by network sci¬ 
ence. 

The dendrogram obtained by clustering the top key¬ 
words, shown in Fig. [2jc), provided interesting insights. 
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Figure 2: Projection of the CN network (a) obtained by force-directed embedding with node colors representing the communities. The 
legends shows top keywords for each community ranked according to Eq. The relationships among communities obtained for the CN 
network are displayed in a coarse-grained diagram (b). The diagram is obtained by collapsing each community in a single node with edges 
weighted by the fraction of original edges existing against all possible between two communities. Edges are represented by lines with thickness 
and intensity proportional to their weights. The top 50 keywords for the entire CN network are displayed in a dendrogram (c) built with the 
hierarchical agglomerative clusterization method applied to the topological distance between the keywords. 


For instance, keywords from the field of ecological appli¬ 
cations of complex networks associated with papers con¬ 
taining the words ”ecosystem”, ’’food web” and ’’biodi¬ 
versity”, are closely related among themselves. Although 
further investigations are needed to explain some counter 
intuitive exceptions such as the branch containing the key¬ 
words ’’promote”, ’’player” and ’’animal”, on the whole, 
the relationships between keywords are well described by 
the dendrogram and appears consistent with what should 
be expected from an expert in the area. 

The analysis was complemented using the accessibility 
metric. The cumulative distribution of accessibility for 
h = 3 taken over all nodes of the CN network is presented 
in Fig. [3] We chose to calculate accessibility for level h = 


3 because node-centered measurements taken around the 
immediate neighborhood of a node (i.e. for h = 1 or h = 2) 
may depend on its degree (Costa and Silva 2006). Also, 
because the networks are small-world, the measurement 
may suffer from border effects for large h. The data is 
grouped together by the community membership of nodes, 
hence each community has a different curve of cumulative 
accessibility distribution. With the data so presented it is 
easy to determine the percentage of nodes below or above 
a certain accessibility threshold. For instance, community 
B possesses only roughly 10% of nodes with accessibility 
1000 or lower. 

We consider peripheral those communities containing 
many vertices with low accessibility. The area under the 


6 



















































































































peripheral +-{■ 

j 



Area under the curves 

i- hh -hh-h 

GAF HCD 


H central 

B 


A - synchronization, coupling, delay, lyapunov, dynamical network 
B - epidemic spreading, infect, susceptible, outbreak, epidemic model 
C - language, text, software system, market, software engineering 
D - traffic, cascade failure, attack, congestion, load 
E - gene, cell, protein interaction, regulatory, biological 
F - community structure, community detection, modularity, algorithm, partition 
G - brain network, functional connectivity, cortical, functional network, healthy 
H - fractal, self similar, first passage, passage time, random walk 
I - dilemma game, cooperation, prisoner dilemma, evolutionary, payoff 
J - time series, construct, climate, visibility graph, phase space 


L 

3000 

Accessibility h=3 


i 

4000 


L 

5000 


Figure 3: Curves of cumulative distribution of accessibility obtained for the CN network communities. The curves are presented in color 
according to the inset. On top of the figure the total area under the curves of each community is shown, which is related to the centrality or 
peripheral nature of its nodes. 


accessibility curves can be used to rank the communities 
according to their pertinence to the borders of the net¬ 
work. Communities covering a large area under the curves 
are at the boundaries of the network, as displayed on the 
top of Fig. [3j Community J (time series, climate and vis¬ 
ibility graph) is the most peripheral, followed by / (game, 
cooperation and prisoner dilemma) and E (protein, gene 
and cell networks). In particular, community E has about 
20% of papers with very low accessibility. Communities 
G (brain and cortical networks), A (synchronization and 
coupling) and F (community structure and community de¬ 
tection) are close together and present average values of 
accessibility. The curves for H (fractal, self similar and 
first passage), C (language, text and software system) and 
D (traffic, attack, cascade failure) also present similar pat¬ 
terns of accessibility among themselves and are much more 
at the core of the network than the aforementioned com¬ 
munities. 

Corroborating the qualitative results from the analysis 
of the coarse grained graph, the most central community 
was B. The central core of the network is composed of 
communities related to techniques of network dynamics 
such as cascade failure, epidemic spreading dynamics and 
self-similarity techniques. On the borders are found more 
specific applications of networks such as cell networks co¬ 
operation and time series analysis. 

4.2. PC network analysis 

The most striking feature of the science map repre¬ 
sented by the PC network is its diploid nature, with two 
very distinct giant communities visualized in Figure |4^a). 
From the analysis of keywords associated with these giant 
communities it is readily noted that they refer to scien¬ 
tists from very distinct areas. The smaller giant commu¬ 
nity comprises papers from telecommunications, e.g. with 
keywords deriving from the photonic crystal fiber topic. 


Indeed, the keywords related to the communities from this 
giant community are (confinement loss, long period, high 
birefringence) for A , (supercontinuum generation, soliton) 
for F, (fiber laser, erbium dope, dope fiber) for K and 
(porous silicon, silicon photonic, monitor) for M. The 
authors in this giant community are normally engineers 
exploiting fibers for telecommunications. The larger giant 
community is made of papers authored by experts in the 
development of the science of photonic crystals, mostly 
physicists. The interface between the two giant communi¬ 
ties is quite thin, as shown in the figure, thus indicating 
little scientific interaction across the two enlarged commu¬ 
nities. 

The interface between the two giant communities is 
better visualized in the coarse-grained graph in Fig. |4^b), 
featuring connections from nodes in communities E (one 
dimensional, transfer matrix, matrix method, omnidirec¬ 
tional), G (negative refraction, self collimation), / (detec¬ 
tion, biosensor, label free) and especially L (vertical cav¬ 
ity, cavity surface, vcsel, surface emit). Also clear from the 
coarse-grained graph is the difficulty in establishing which 
communities are most central or peripheral owing to the 
diploid nature of the network. 

Here is a case where the accessibility metric is most 
useful. Because it is a local measurement, it avoids the 
pitfalls of other global centrality measurements when used 
to characterize networks presenting no well-defined border 
and central regions. When applied to the PC network, the 
analysis of cumulative accessibility in Figure [5] revealed 
that communities K and L are those most at the bor¬ 
ders, followed by communities F, H and J. Communities 
C and B are the most central in the network. Commu¬ 
nity A can also be considered a central community on 
this smaller giant component. Analogously to what was 
observed for the CN network, general concepts of the PC 
field were found in the core of the system, such as papers of 
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Figure 4: Projection (a), coarse-grained diagram (b) and keywords dendrogram (c) of the PC network obtained in the same fashion as Fig. [ 2 ] 
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Figure 5: Cumulative accessibility distribution obtained for the PC network communities. 
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Figure 6: Projection (a), coarse-grained diagram (b) and dendrogram (c) with the keywords provided by an expert for the PC network. 
Differently from Figs. [2] and^] the regions depicted by colors in (a) correspond to the groups obtained after applying a threshold on the 
dendrogram as indicated by a dashed red line in (c). 


communities B and C comprising nodes having keywords 
”nanocavity”, ”quantum dot”, ”waveguide”, ”slow light”, 
etc. On the other hand, more specific methodologies and 
applications are scattered on the borders of the network, 
such as in papers containing the keywords ”fiber laser”, 
”erbium dope”, ”vertical cavity”, ”transfer matrix”, ”one 
dimensional”, etc. The taxonomy reached by using the au¬ 
tomated keywords for the PC network is consistent with 
expectation from experts, as indicated in the dendrogram 
of Fig.gc). 

We also used a list of 67 keywords, containing up to 4 
words each, provided by one of the authors (MB), expert 
in the PC field. The dendrogram was constructed with the 
same approach as for the automated keywords in Fig.[6jc). 
It also provides valuable insights about the area, such as 
the fact that negative refraction index is closely related 


to metamaterials , which in turn are key concepts for the 
technology that allows the development of an invisibility 
cloak (Schurig et al. 2006 Soric et al. 2013). Another 
example concerns the keyword liquid crystal , which ap¬ 
pears, as expected, close to photonic bandgap. A science 
map of the PC network was obtained using the experts 
keywords, where partitioning was reached by applying a 
threshold (as shown by the dashed line and group labels 
in Fig.[6jc)) to the dendrogram. The nodes were assigned 
to a community when their corresponding abstracts shared 
a large number of keywords that define a specific group. 
A comparison of Figs. |4ja) and[6ja) points to a narrower 
coverage of nodes for the keywords suggested by the ex¬ 
pert for the small giant community associated with the 
telecommunications area. This was indeed expected be¬ 
cause the expert (a physicist) has always worked with top- 
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ics akin to the large giant community and had less fa¬ 
miliarity with the use of photonic crystals in telecommu¬ 
nications. The coarse-grained network shown in Fig. [6^b) 
bears little resemblance to the one obtained from the com¬ 
munity analysis of the network with the groups of the 
former connecting strongly among themselves. However, a 
correspondence between some of the network communities 
and the groups of the experts keywords partitioning can 
be drawn by observing the communities sharing the same 
regions of the network (i.e. sharing a similar set of nodes). 
For instance, community G (in Fig. Qb)) shares the same 
region as the group F' (in Fig.[6jb)), also displaying simi¬ 
lar keywords, corresponding to subjects related to negative 
refraction and cloaking. In the same fashion, communities 
C and H share the same region of groups E' and G'. 

To illustrate the possible replacement of methods in 
one of the steps of our framework, we also imported the 
network and labeled partitions into the VOS Viewer tool Qn 


et al. 2002), cooperation dynamics (Yang et ah, 2009) 


and QSAR model (Santana et al. 2008). In PC field we 


Eck and Waltman 2010). This visualization software has 


been used to construct scientific maps from network-based 
data encompassing a diverse range of disciplines and sci¬ 
entific fields. Figure [7] displays the projections attained by 
the software. The existence of the two major groups in the 
PC network is clearly more accentuated in the VOS Viewer 
visualization than by using the force-directed method, both 
in the positions (a) as well as in the density map (b). How¬ 
ever, because of the anisotropic nature of the resulting 
map, some other aspects of the network structure can¬ 
not be observed clearly. For instance, it is difficult to tell 
how interconnected groups A and F are. In contrast, the 
isotropic nature of the maps obtained by the force-directed 
methodology reveals an informative interface between the 
two groups, which is reflected more clearly by the coarse¬ 
grained analysis. Nevertheless, the aims of the visualiza¬ 
tion techniques are different and may highlight distinct 
characteristics of the data. Perhaps the most useful ap¬ 
proach is to use as many suitable visualization techniques 
as possible to draw better conclusions and attain deeper 
understanding of the datasets and of the analysis. 

The temporal evolution of the areas was examined by 
considering the timeline for the keywords. We counted 
the number of abstracts which contain the top keywords 
obtained from the ranking index in Eq. [4] As the number 
of papers may greatly vary with the years, the frequencies 
were normalized by the total number of papers published 
in the same year. The resulting timelines are shown in 
Fig-i Because there are not many papers in the database 
for the years before 2003 for the CN network, and 1998 for 
the PC, only the subsequent years were considered. 

The timelines confirm the extraordinary growth of both 
CN and PC areas (as shown on top of Fig. [8|, but the 
growth rate decreased in the last few years. Several areas 
of CN have been growing: network applications to time se- 


tion dynamics and analysis (Arenas et al., 2008), commu¬ 


nity detection (Fortunato, 2010); while other subtopics are 


ries (Lacasa et al., 2008 Donner et al., 2010), synchroniza- (Amancio et al. 2012c Ciotti et al., 2016) 


can also observe distinct growth patterns. The subtopics 
hollow core photonic, fiber laser, erbium dope fiber, su¬ 
percontinuum generation, detection, stable and biosensor 
are still growing on the network, while usage of terms light 
extraction efficiency, diode led and negative refraction are 
decreasing. 

5. Conclusion and Future Work 

The main goal of this paper was to introduce methods 
that could be used to automatically construct surveys on 
a given scientific field. We proposed a methodology to 
simultaneously analyze contextual information (in terms 
of papers abstracts) and citation networks, and this was 
applied to two fields: Complex Networks and Photonic 
Crystals. Upon identifying communities, it was possible 
to generate a taxonomy for these fields. 

Several patterns could be inferred from the results. For 
complex networks, for instance, border communities were 
found to be related to regulatory and protein-protein in¬ 
teraction networks, in addition to subtopics related to cli¬ 
mate, time series and visibility graphs. The interpretation 
is that these subtopics are not fully explored, at the mo¬ 
ment, by the many complex networks analysis methods. 

The PC network was peculiar in featuring two giant 
communities, each of which could be identified by ana¬ 
lyzing the keywords. As expected, we found that one 
giant community comprises telecommunication engineers 
who use photonic crystal fibers in their applications, while 
the other, larger community is composed mainly of physi¬ 
cists. Surprisingly, not much interaction exists between 
the two communities, and this piece of information may 
be valuable to foster collaboration in the future. 

The approach proposed here to construct the taxon¬ 
omy for a survey differs significantly from what exists in 
the literature. Instead of using only similarities between 
terms of each abstract, here a citation network was used to 
provide both the distance among terms and the clustering 
(derived from the community structure). In addition, a 
simple text analytics technique was employed to provide 
the salience of terms according to the obtained community 
structure. 

Here, we did not compare our results to those obtained 
from traditional text analytics techniques, particularly be¬ 
cause the methods address two different classes of prob¬ 
lems. Our approach takes into consideration how, in prac¬ 
tice, researchers refer to other works in their fields, which 
may differ significantly from the similarity of terms ob¬ 
tained using only the textual content. The discrepancy 
between cited works and their contextual similarity has 
been a recent topic of study, with an in-depth analysis 

We under¬ 


shrinking, such as food web and species networks (Dunne 


stand that the organization of the scientific community, 
i.e., the citation patterns among researchers and papers, 
must play an important role for constructing a survey in a 
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Figure 7: Visualization of the PC network using the VOSViewer visualization tool |van Eck and Waltman| [201 0). The colors represent the 
same communities displayed in figure^] Both the network (a) and densities (b) are shown. 


science field. In this context, our approach is more suitable 
for this task than methods based solely on text similarity. 

We can still compare the technical limitations of the 
approach presented here and of those based on text an¬ 
alytics. For instance, one of the main disadvantages of 
topic analysis is the high computational cost involved in 
estimating the Markov model, which requires several iter¬ 
ations of Gibbs Sampling. This kind of analysis precludes 
the study of bigrams and higher order n-grams, while our 
approach can be extended to account for n-grams. In ad¬ 
dition, the limitations of such analysis are not yet com¬ 
pletely understood (Tang et ah, 2014). Other methods 
such as those based on supervised learning need the input 
of annotated corpus or sets of golden summaries, which 
are not commonly available in scientific datasets. We how¬ 
ever should point out that our approach is strongly depen¬ 
dent on the chosen network structure. If a co-authorship 
network among papers was used, instead of the citation 
network, the results should be interpreted in a different 
direction and could not be used, for instance, to construct 
a survey. As for topic analysis, an extensive study of the 
limitations of our approach is still needed to identify its 
strengths and disadvantages. 

Several extensions of the approach we presented can be 


performed in future works. For simplicity, we did not con¬ 
sider the direction of the citation networks or the strongly 
asymmetric nature of the networks. These features could 
play an important role in the understanding of how dis¬ 
tinct fields interact among themselves by citations. 

In our methodology we did not take into consideration 
the importance and redundancy of papers. These limita¬ 
tions may be surpassed by using topological characteriza¬ 
tion at the level of papers. Future research should also 
address the problem of quantifying the interdisciplinarity. 

It is hoped that the approach inherent in the methods we 
introduced can be applied to build new tools and assist 
researchers in understanding their own or new specialty 
areas. 
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Figure 8: Normalized frequency of occurrence for each keyword among papers published in the time period considered. The head graphics 
present the curves corresponding to the number of papers published in the corresponding years. 
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